Cost Analysis of Classification Using CL and MM

نویسندگان

  • László Kovács
  • Péter Barabás
چکیده

Computational linguistics covers the statistical and logical modeling of languages using computer-based software-hardware tools. An important component in CL systems is the morphological parser. The scope of our study is to build a statistical method to learn the rules of word inflection. The pre-requirement regarding the language is that the language uses words which are sequences of characters. A key factor of the required clustering algorithm is the cost efficiency. After analysis of the alternatives, two methods were selected to perform further refinement and adaptation: the observable Markov Model method and the formal concept analysis method. 1 Classification Problem in CL Computational linguistics (CL) covers the statistical and logical modeling of languages using computer-based software-hardware tools. An important component in CL systems is the morphological parser. A morpheme is the minimal unit with meaning in the language. The key morpheme for a concept is the stem. All of the transformations are defined on the stems. The stems should determine the base concept. The application context of the concept is given with affixes. The affixes give additional meaning of various kinds. Depending on the location of the affix, the affix unit may be called prefix, suffix, infix or circumfix [1]. The application of affixes may result in a new concept. In this case, the transformation is called derivation. If the output belongs to the same concept family, the transformation is called inflection. In the agglutinative languages, the inflections are more complex, a stem can be extended with ten or more affixes. The problem of living languages is that the dictionary describing the rules and exceptions is huge and not static. The building and updating these dictionaries is an expensive process. Our goal is to investigate the automated dictionary generation. The scope of our study is to build a statistical method to learn the rules of word inflection. The pre-requirement regarding the language is that the language uses L. Kovács et al. Cost Analysis of Classification Using CL and MM 228 words which are sequences of characters. In our language model, the words are built up from characters. During the inflection, a word is mapped to a word (a different word or the same word). The initial word is called stem. The grammar rule describes the generation of the inflected form from the stem. It can be assumed that the transformation rules depend on the stem form of the concepts. There are distinct and relative few rules in the languages, i.e. the number of rules is much more less than the number of stems. Thus, in our approach, the rule assignment task is considered as a classification method, each stem is assigned to a rule class. The set of possible rule classes is extracted from a training data set. The investigation is based on the following items: W : set of words T = {(w,w’)| w,w’ ∈ W} : a training set, where w is a stem and w’ is the transformed form R = {r | r : W → W} : the set of transformation rules, where r(w) results in the transformed form for a w stem. G : W → R : the grammar which is treated as a rule classification function. The task of rule learning consists of the following sub-problems: • extraction of transformation rule from the training set • determining of association rule between the stem and rule class One of the main requirements on the method is the efficiency. As the set of words W can be very large, the learning algorithm should have a polynomial cost function of low grade. In our approach, the low cost should be achieved through the following steps: • providing an efficient generalization • allowing approximation in the rule system For this purpose, the statistical language processing is an appropriate solution. Considering the classification methods used in data mining, the following alternatives can be used: decision trees, Bayes-methods and ANN methods. On the other hand, our problem area has some special characteristics that are not supported by the classical classification methods, like • large number of attributes • incremental learning method • support of early approximation Considering the main statistical methods to build morphological parsers are the Hidden Markov Models (HMM), the N-gram models (NGM) and the Finite State Magyar Kutatók 8. Nemzetközi Szimpóziuma 8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics 229 Transducers (FST). The drawback of these algorithms is that these are not intended to perform classification. They are usually used to learn the direct transformation steps. Thus new algorithms for learning the rule classes had to be developed and to be tested. After analysis of the alternatives, two methods were selected to perform further refinement and adaptation: the observable Markov Model method and the formal concept analysis method. 2 Classification with Markov Model 2.1 N-grams and Markov Models In statistical language processing N-gram model is widely used. N-gram models are essential in speech recognition, handwriting recognition, machine translation, spelling correction, part-of-speech tagging, natural language generation and in any task where words have to be identified from noisy, ambiguous input. Our goal is to compute the probability of a word w given some h history, or P(w|h). A sequence of N words is represented as follows [2]: n w , , w ... 1 or n w1 (1) The probabilities of entire sequences like ( ) n w , , w P ... 1 can be computed with using the chain rule of probability for decomposition [1]: ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 1 1 2 1 3 1 2 1 1 | | ... | | − − ∏ k k n = k n n n w w P = w w P w w P w w P w P = w P (2) The chain rule shows that the joint probability of a sequence can be computed from the probability of words given previous words. But using the chain rule doesn't really help us, because the way to compute the probability of a word a long sequence of preceding words is not known. Estimation by counting word occurrences following a long sequence of words is not the correct way, because the language is creative and can be produce sequences never seen before. Thus instead of computing the probability of a word given its entire history we can approximate the probability of a word by just a few preceding words. The bigram model is such an N-gram model where the probability of a word is approximated by the preceding word [2]: ( ) ( ) 1 1 1 | | − − ≈ n n n n w w P w w P (3) whereas in trigram model this probability is: P wn |w1 n 1 P wn |wn 2 n 1 . (4) In general we can make the following approximation: L. Kovács et al. Cost Analysis of Classification Using CL and MM 230 ( ) ( ) 1 1 1 1 | | − − − ≈ n + N n n n n w w P w w P (5) This property that the probability of a word depends only on the previous word is called Markov property stated as [3]: ( ) ( ) t i t t+ i t+ t i t i t+ i t+ s = ξ s = ξ P = s = ξ , , s = ξ s = ξ P | ... | 1 1 1 1 1 1 (6) where n ξ , , ξ ... 1 are random variables. In a Markov chain we can attribute each state with a finite set of signals. After each transition, one of the signals associated with the current state is emitted. Thus, we can introduce a new sequence of random variables ηt, t=1...T, which is the emitted signal in time t. This determines a Markov model. In Markov models we can define [3]: • a finite set of states ; s , , s = Ω n} ... { 1 • an signal alphabet ; σ , , σ = Σ m} ... { 1 • a n x n state transition matrix ( ); s = ξ s = ξ P = wherep , p = P i t j + t ij ij | ] [ 1 • an n x m signal matrix ( ) i t j t ij ij s = ξ k = η P = wherea , a = A | } { • an initial vector ( ) i i n s = ξ P = wherev , v , , v = v 1 1 ] ... [ In every point of time a state transition occurs depending on the transition probabilities and after each transition, one of the signals associated with the current state is emitted. In general a state in the model depends on the previous state of the model, called first-order Markov model. When a state depends on more preceding states these are called as higher-order Markov models, such as 2-order, ... N-order Markov model. Bigram model is a first-order Markov model, whereas trigram model is a second-order and n-gram model is a n-1-order Markov model. 2.2 Algorithm for Classification The rule detection can be defined as a classification method where words which are inflected in the same way belong to the same class. The classes are not predefined but they are extracted from the training samples. The training samples consist of word pairs such as a stem and its well transformed inflected form. A class can be signed as { } y w w x w w W w w w C o y x . ' , . : | ) ' , ( 0 0 , = = ∈ ∃ = (7) The number of classes depends on the training samples. For each class an observable Markov Model is generated. The states of the model are the unigrams, Magyar Kutatók 8. Nemzetközi Szimpóziuma 8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics 231 bigrams, trigrams, etc. of the stems. To choose a correct-order Markov Model is an optimization problem. The higher-order is the model the more accurate can be the result and the more costly is the algorithm. E.g. in case of a 1-order MM the number of states is maximum 35, in case of a 2-order MM it is 35, in case of a N-order MM it is 35. The number of cells of the transition matrix is the square of the states whereby can be seen that the cost of count depends on the good choice of the MM. The training algorithm is the following: 1 get the first pair from the samples 2 determine the class from the word pair 3 if the class is not exist establish the class and the model 4 add the stem to the model of the current class a) break the word into n-grams (n is a predefined constant) b) add the n-grams as states to the model c) recalculate values in the transition matrix 5 get the next pair from the samples and go to step 2 The numbers of states of the models are varying because it depends on the training samples like the number of classes. It is not a difficult algorithm accordingly the cost of the training is much more lower than in case of more difficult nets or in FCA, but depends on the order of model. In testing phase a stem is given as input to the classifier and a class as the inflection rule is resulted as output. The classifier works with observable Markov models since the output of the process is the set of states at each instant of time, where each state corresponds to a physical (observable) event [4]. The stem have to be broken into n-grams which will be the observation sequence O as O = {w1,w2,w3,...wn}. We wish to determine the probability of O, given the models. We have a model for each class. In first approach we can count the P(O|modeli) probability for all class. P(O|modeli) = P(w1) · P(w2|w1) · P(w3|w2) · ... · P(wn|wn-1) (8) The maximum probability belongs to the winner model: Pmax = max {P(O|modeli} (9) The algorithm can be refined with eliminating classes which does not fit to the testing stem (wT). The testing classes are the follows: { } x w w W w C C T y x T . : | 0 0 , = ∈ ∃ = (10) L. Kovács et al. Cost Analysis of Classification Using CL and MM

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Proposing a Novel Cost Sensitive Imbalanced Classification Method based on Hybrid of New Fuzzy Cost Assigning Approaches, Fuzzy Clustering and Evolutionary Algorithms

In this paper, a new hybrid methodology is introduced to design a cost-sensitive fuzzy rule-based classification system. A novel cost metric is proposed based on the combination of three different concepts: Entropy, Gini index and DKM criterion. In order to calculate the effective cost of patterns, a hybrid of fuzzy c-means clustering and particle swarm optimization algorithm is utilized. This ...

متن کامل

Comparison of Machine Learning Algorithms for Broad Leaf Species Classification Using UAV-RGB Images

Abstract: Knowing the tree species combination of forests provides valuable information for studying the forest’s economic value, fire risk assessment, biodiversity monitoring, and wildlife habitat improvement. Fieldwork is often time-consuming and labor-required, free satellite data are available in coarse resolution and the use of manned aircraft is relatively costly. Recently, unmanned aeria...

متن کامل

بررسی تأثیر زلزله و هزینه‌های پیشگیری بر میزان بروز لیشمانیوز پوستی در شهرستان بم مبتنی بر تجربه طبیعی

Background & objectives: Cutaneous leishmaniasis (CL) is an endemic disease in district of Bam. It has created considerable concerns by people and health authorities. The objective of this study was to assess the effect of the earthquake and costs of prevention on CL prevalence after earthquake Methods: This research is based on a natural intervention, in which the information related to the...

متن کامل

Ensemble Classification and Extended Feature Selection for Credit Card Fraud Detection

Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effec...

متن کامل

برآورد بهای تمام شده خدمات بهداشتی درمانی ارائه شده به بیماران مبتلا به لیشمانیوز جلدی در استان قم

Background and Aim: Cutaneous leishmaniasis (CL) is prevalent in most tropical and subtropical countries of the world and leads to adverse economic consequences. This study was performed to estimate the cost of healthcare services delivered to CL patients in Qom Province. Materials and Methods: Based on the data from 638 CL patients, this cross-sectional study was performed in Qom during 2009-2...

متن کامل

Dosimetric Evaluation of Volumetric Modulated Arc Therapy (VMAT) and Intensity Modulated Radiotherapy (IMRT) Using AAPM TG 119 Protocol

Background: The commissioning accuracy of Volumetric Modulated Arc Therapy (VMAT) need to be evaluated.Objective: To test and evaluate commissioning accuracy of VMAT based on the TG 119 protocols at local institution. Material and Methods: The phantom, structure sets, VMAT and IMRT beam parameter setup, dose prescriptions and planning objectives were following TG 119 guidelines to c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007